(adapted from T. Kwartler's excellent course Text Mining: Bag of Words on DataCamp.com)
In this Notebook, you will be introduced to the basic notions of Text Mining using the tm and qdap R libraries. For most cells, the descriptions/lead-up can be found in the commented lines (in green -- comments starts with the # character). In some instances, explanations are provided separately.
The main dataset you will work with is the text content of Associated Press game recaps involving the Ottawa Senators during the 2016-2017 NHL season.
tmqdaptmlibrary('tm') # R text mining library
library('qdap') # R quantitative discourse analysis package
new_text <- "The Ottawa Senators have the Atlantic Division lead in their sights. Mark Stone had a goal and four assists, Derick Brassard scored twice in the third period and the Senators recovered after blowing a two-goal lead to beat the Toronto Maple Leafs 6-3 on Saturday night. The Senators pulled within two points of Montreal for first place in the Atlantic Division with three games in hand. We like where we're at. We're in a good spot, Stone said. But there's a little bit more that we want. Obviously, there's teams coming and we want to try and create separation, so the only way to do that is keep winning hockey games. Ottawa led 2-0 after one period but trailed 3-2 in the third before getting a tying goal from Mike Hoffman and a power-play goal from Brassard. Stone and Brassard added empty-netters, and Chris Wideman and Ryan Dzingel also scored for the Senators. Ottawa has won four of five overall and three of four against the Leafs this season. Craig Anderson stopped 34 shots. Morgan Rielly, Nazem Kadri and William Nylander scored and Auston Matthews had two assists for the Maple Leafs. Frederik Andersen allowed four goals on 40 shots. Toronto has lost eight of 11 and entered the night with a tenuous grip on the final wild-card spot in the Eastern Conference. The reality is we're all big boys, we can read the standings. You've got to win hockey games, Babcock said. After Nylander made it 3-2 with a power-play goal 2:04 into the third, Hoffman tied it by rifling a shot from the right faceoff circle off the post and in. On a power play 54 seconds later, Andersen stopped Erik Karlsson's point shot, but Brassard jumped on the rebound and put it in for a 4-3 lead. Wideman started the scoring in the first, firing a point shot through traffic moments after Stone beat Nikita Zaitsev for a puck behind the Leafs goal. Dzingel added to the lead when he deflected Marc Methot's point shot 20 seconds later. Andersen stopped three shots during a lengthy 5-on-3 during the second period, and the Leafs got on the board about three minutes later. Rielly scored with 5:22 left in the second by chasing down a wide shot from Matthews, carrying it to the point and shooting through a crowd in front. About three minutes later, Zaitsev fired a shot from the right point that sneaked through Anderson's pads and slid behind the net. Kadri chased it down and banked it off Dzingel's helmet and in for his 24th goal of the season. Dzingel had fallen in the crease trying to prevent Kadri from stuffing the rebound in. Our game plan didn't change for the third period, and that's just the maturity we're gaining over time, Senators coach Guy Boucher said. Our leaders have been doing a great job, but collectively, the team has grown dramatically in terms of having poise, executing under pressure. Game notes : Mitch Marner sat out for Toronto with an upper-body injury. Marner leads Toronto with 48 points and is also expected to sit Sunday night against Carolina."
# Print new_text to the console
new_text
# Find the 20 most frequent terms: term_count
term_count <- freq_terms(new_text,20)
# Plot term_count
plot(term_count)
# Import text data
recaps <- read.csv(file="Data/Recap_data.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
# View the structure of recaps
str(recaps)
# Print out the number of rows in recaps
nrow(recaps)
# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP_Recap
# Show first 6 recaps
head(AP.recaps)
There are odd characters in the game recaps (\xd1?), which highlight some issue with text encoding and formatting. Let's revisit the last few steps with a slightly different data file.
# Import text data
recaps <- read.csv(file="Data/Recap_data_first_pass.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
# View the structure of recaps
str(recaps)
# Print out the number of rows in recaps
nrow(recaps)
# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP.Recap
# Show first 6 recaps
head(AP.recaps)
For reasons that are perhaps too technical to get into at this point, the encoding of Recap_data_first_pass.csv creates issues with tm and qdap down the road, but the issues disappear when we use a different encoding (UTF-8).
# Import text data
recaps <- read.csv(file="Data/Recap_data_first_pass_utf8.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
# Isolate text from recaps: AP.recaps
AP.recaps <- recaps$AP.Recap
# Make a vector source: AP.recaps.source
AP.recaps.source <- VectorSource(AP.recaps)
# Make a volatile corpus: AP.recaps.corpus
AP.recaps.corpus <- VCorpus(AP.recaps.source)
# Print out AP.recaps.corpus
AP.recaps.corpus
# Print data on the 15th recap AP.recaps.corpus
AP.recaps.corpus[[15]]
# Print the content of the 15th recap in AP.recaps.corpus
AP.recaps.corpus[[15]][1]
# Print the meta of the 15th recap in AP.recaps.corpus
AP.recaps.corpus[[15]][2]
We can also take a look at some basic statistics regarding the number of characters and the number of words in the game recaps.
# Statistics on the recap's number of characters
length_of_recaps_char <- vector(mode="numeric", length=nrow(recaps))
for(j in 1:nrow(recaps)){length_of_recaps_char[j]=nchar(AP.recaps.corpus[[j]][1])}
hist(length_of_recaps_char, freq=F, main="Distribution of # of characters in Senators game recaps (16-17)")
summary(length_of_recaps_char)
# Statistics on the recap's number of words
length_of_recaps_word <- vector(mode="numeric", length=nrow(recaps))
for(j in 1:nrow(recaps)){length_of_recaps_word[j]=length(strsplit(gsub(' {2,}',' ',AP.recaps.corpus[[j]][1]),' ')[[1]])}
hist(length_of_recaps_word, freq=F, main="Distribution of # of words in Senators game recaps (16-17)")
summary(length_of_recaps_word)
# Create the object: text
text <- "<i>He</i> went to bed at 2 A.M. It\'s way too late! He was only 20% asleep at first, but sleep eventually came."
text
# All lowercase
tolower(text)
# Remove punctuation
removePunctuation(text)
# Remove numbers
removeNumbers(text)
# Remove whitespace
stripWhitespace(text)
# Remove text within brackets
bracketX(text)
# Replace numbers with words
replace_number(text)
# Replace abbreviations
replace_abbreviation(text)
# Replace contractions
replace_contraction(text)
# Replace symbols with words
replace_symbol(text)
# List standard English stop words
stopwords("en")
# Print text without standard stop words
removeWords(text,stopwords("en"))
# Add "sleep" and "asleep" to the list: new_stops
new_stops <- c("sleep","asleep",stopwords("en"))
# Remove stop words from text
removeWords(text,new_stops)
Now combine some pre-processing steps into one call:
tolower(
stripWhitespace(
removeWords(
removePunctuation(
replace_symbol(
replace_contraction(
replace_abbreviation(
bracketX(text)
)
)
)
)
,stopwords("en"))
)
)
# Create sleep
(sleep <- c("sleepful","sleeps","sleeping"))
# Perform word stemming: stem_doc
(stem_doc <- stemDocument(sleep))
# Create the completion dictionary: sleep_dict
sleep_dict <- c("sleep")
# Perform stem completion: complete_text
complete_text <- stemCompletion(stem_doc,sleep_dict)
# Print complete_text
complete_text
(text_data <- "In sleepful nights, Katia sleeps to achieve sleeping.")
(comp_dict <- c("In","sleep","nights","Katia","to","achieve"))
# Remove punctuation: rm_punc
rm_punc <- removePunctuation(text_data)
# Create character vector: n_char_vec
n_char_vec <- unlist(strsplit(rm_punc, split = ' '))
# Perform word stemming: stem_doc
stem_doc <- stemDocument(n_char_vec)
# Print stem_doc
stem_doc
# Re-complete stemmed document: complete_doc
complete_doc <- stemCompletion(stem_doc,comp_dict)
# Print complete_doc
complete_doc
# Create a function to clean the corpus, mixing tm and qdap functions
clean_corpus <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en")))
return(corpus)
}
# Apply your customized function to the AP.recaps.corpus: clean_corp.AP.recaps
clean_corp.AP.recaps <- clean_corpus(AP.recaps.corpus)
# Print out a cleaned up recap
clean_corp.AP.recaps[[15]][1]
# Print out the same tweet in original form
recaps$AP.Recap[15]
One thing to keep in mind: there is no secret pre-processing formula that will work with all corpora. Context is king/queen.
Let's revisit the first text we looked at:
# Find the 20 most frequent terms: term_count
term_count <- freq_terms(clean_corp.AP.recaps[[56]][1],20)
# Plot term_count
plot(term_count)
# Create the dtm from clean_corp.AP.recaps: AP.recaps_dtm
AP.recaps_dtm <- DocumentTermMatrix(clean_corp.AP.recaps)
# Print out AP.recaps_dtm data
AP.recaps_dtm
# Convert AP.recaps_dtm to a matrix: AP.recaps_m
AP.recaps_m <- as.matrix(AP.recaps_dtm)
# Print the dimensions of AP.recaps_m
dim(AP.recaps_m)
# Review a portion of the matrix
AP.recaps_m[79:84, 1005:1010]
# Create a TDM from clean_corp.AP.recaps: AP.recaps_tdm
AP.recaps_tdm <- TermDocumentMatrix(clean_corp.AP.recaps)
# Print AP.recaps_tdm data
AP.recaps_tdm
# Convert AP.recaps_tdm to a matrix: AP.recaps_m
AP.recaps_m <- as.matrix(AP.recaps_tdm)
# Print the dimensions of the matrix
dim(AP.recaps_m)
# Review a portion of the matrix
AP.recaps_m[1005:1010, 79:84]
# Calculate the rowSums: term_frequency
term_frequency <- rowSums(AP.recaps_m)
# Sort term_frequency in descending order
term_frequency <- sort(term_frequency, decreasing=TRUE)
# View the top 20 most common words
term_frequency[1:20]
# Plot a barchart of the 20 most common words
barplot(term_frequency[1:20], col = "tan", las = 2)
# Load wordcloud package
library('wordcloud')
# Print the first 20 entries in term_frequency
term_frequency[1:20]
# Create word_freqs
word_freqs = data.frame(term_frequency)
word_freqs$term = rownames(word_freqs)
word_freqs = word_freqs[,c(2,1)]
colnames(word_freqs)=c("term","num")
# Create a wordcloud for the values in word_freqs
wordcloud(word_freqs$term, word_freqs$num, max.words=100, colors="red")
# Alter the function code to match the instructions
clean_corpus_Sens <- function(corpus){
corpus <- tm_map(corpus, content_transformer(replace_abbreviation))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, c(stopwords("en"), "game", "first", "second", "third", "Ottawa", "Senators")) # because the recaps are about the Sens, and that info would dominate
return(corpus)
}
# Apply your customized function to the AP.recaps.corpus: clean_corp2.AP.recaps
clean_corp2.AP.recaps <- clean_corpus_Sens(AP.recaps.corpus)
# Create a TDM from clean_corp2.AP.recaps: AP.recaps2_tdm
AP.recaps2_tdm <- TermDocumentMatrix(clean_corp2.AP.recaps)
# Convert AP.recaps2_tdm to a matrix: AP.recaps2_m
AP.recaps2_m <- as.matrix(AP.recaps2_tdm)
# Calculate the rowSums: term_frequency2
term_frequency2 <- rowSums(AP.recaps2_m)
# Sort term_frequency2 in descending order
term_frequency2 <- sort(term_frequency2, decreasing=TRUE)
# Print the first 20 entries in term_frequency2
term_frequency2[1:20]
# Plot a barchart of the 20 most common words
barplot(term_frequency2[1:20], col = "tan", las = 2)
# Create word_freqs2
word_freqs2 = data.frame(term_frequency2)
word_freqs2$term = rownames(word_freqs2)
word_freqs2 = word_freqs2[,c(2,1)]
colnames(word_freqs2)=c("term","num")
# Create a wordcloud for the values in word_freqs2
wordcloud(word_freqs2$term, word_freqs2$num, max.words=100, colors="red")
We could see how often Senator players/coach appear in these recaps.
# Senators players and coach surnames
keep=c("anderson","borowiecki","boucher","brassard","burrows","ceci","chabot","chiasson","claesson","condon","didomenico","drieger","hammond","hoffman","jokipakka","karlsson","lazar","macarthur","mccormick","methot","moore","pageau","phaneuf","puempel","pyatt","ryan","ryans","smith","stalberg","stone","white","wideman","wingels")
# Only keep the Senators surnames
word_freqs3 = word_freqs2[word_freqs2$term %in% keep, ]
# Plot a barchart of the Senators players and coach
barplot(term_frequency2[word_freqs2$term %in% keep], col = "tan", las = 2)
# Create a wordcloud for the values in word_freqs3
wordcloud(word_freqs3$term, word_freqs3$num, max.words=100, colors="red")
Can we conclude anything about the Senators season from these graphs?
SSS_Recap and OPP_Recap AP_Headline, SSS_Headline and OPP_Title